Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Mapping of Sequence Reads to the Reference Genomes ◾ 57

$ 10

A $ 9

GA $ 8

GGA $ 7

TGGA $ 6

CTGGA $ 5

GCTGGA $ 4

GGCTGGA $ 3

TGGCTGGA $ 2

TTGGCTGGA $ 1

CTTGGCTGGA $ 0

Notice that each line includes a suffix (key) and a position (value). Then from the key-value

pairs, we can construct a tree made up of nodes and edges. The positions (values) will be

the nodes and suffixes (keys) will be the edges of the tree. The suffix tree is built as shown

in Figure 2.5, starting from the first suffix on the top and it moves down to make branched

nodes and edges to avoid repeating common characters. This way, we will construct a

suffix tree with nodes and edges. An entire reference genome can be divided into suffixes

and stored this way with both suffixes and indexed positions in the unbranched nodes so

finding a pattern or a position of a read in the reference genome will be easy.

Once the reference sequence is indexed using the suffix tree, one of several searching

algorithms can be used to find the location where a read maps. For instance, to find “TGG”

in Figure 2.5, we will start searching from the root looking for “T”, and from the next node,

we will look for “GG”; thus, since there are two leaf nodes with the indexes 2 and 6, that

means “TGG” is aligned to the reference sequence in the positions 2 and 6 as shown by the

red color (Figure 2.5).

2.2.3 Suffix Arrays

The suffix array (SA) is similar to the suffix tree for pattern matching and finding sub-

strings (reads) and it can be constructed from the suffix tree. It is basically a sorted array

FIGURE 2.5 Nodes and edges of the suffix tree.